feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780
feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780WZhuo wants to merge 2 commits into
Conversation
Add ArrowRowBuilder (inspect/row_builder_internal) to materialize in-memory rows into an ArrowArray for an arbitrary Iceberg schema, with typed append helpers (AppendNull/Boolean/Int/String/StringMap) reused by later metadata tables.
|
I think this is a good point to factor out a small internal nanoarrow builder/helper layer instead of adding another local copy of the same lifecycle code. We already have the same pattern in a few production paths: initialize from schema, start appending, append child values, finish each row/list/map element, then finish building. For example, A concrete way to keep this scoped would be:
That should fix the leak path here and avoid repeating this fragile nanoarrow lifecycle in the next metadata-table PRs without turning this into a broad refactor. |
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target
|
Thanks for the thorough review. All three points addressed:
|
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix) - Move files from inspect/ to core level; rename to arrow_row_builder* - Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class - Add Make(const ArrowSchema*) overload for low-level callers - Add ArrowArrayGuard::Release() and guard InitFromSchema in Make() - Move test to arrow_row_builder_test.cc in data_test target
Summary
Adds
ArrowRowBuilder(arrow_row_builder_internal.h/arrow_row_builder.cc), a schema-driven RAII helper that materializes in-memory rows into an ArrowArrowArray(a struct batch) for an arbitrary Iceberg schema. It wraps the nanoarrow boilerplate and exposes per-column access plus typed append free functions, so metadata tables (snapshots, history, manifests, …) can emit rows without re-implementing it.This is the first of a series splitting metadata-table support into focused PRs; the
InMemoryBatchReaderand theSnapshotsTable::Scanintegration are intended to follow in separate PRs that build on this.What's included
ArrowRowBuilder— a single RAII class (move-only) withMake(const Schema&)andMake(const ArrowSchema*)overloads. Handles the full nanoarrow lifecycle:InitFromSchema→StartAppending→ … append values … →FinishBuilding→Release. TheArrowArrayis guarded immediately afterInitFromSchemaso a failure inStartAppendingreleases it automatically.ArrowArrayGuard::Release()— added to the existing guard so other call sites (position_delete_writer,manifest_adapter) can reuse the RAII-release pattern instead of manually managing nanoarrow resources.icebergnamespace:AppendNull,AppendBoolean,AppendInt(covers int32/int64/timestamp via nanoarrow's int64),AppendString,AppendStringMap.iceberglibrary level — it only needs nanoarrow +ToArrowSchema(no Apache Arrow), matching peers likemanifest_adapterandarrow_c_data_util.arrow_row_builder_test.cccovering typed appends (int32/string/int64/boolean/map), null handling for optional columns, multi-entry/empty string maps, zero-row batches, and column-index bounds. Compiled into theiceberg-data-testtest target.Testing
cmake --build build --target iceberg-data-testthen ran the test binary — 5/5ArrowRowBuilderTesttests pass.ctestgreen.arrow::ImportRecordBatch), so its target is underUSE_BUNDLE.Notes